Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for ZCMT Extension for Code-Size Reduction in CVA6 #2659

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

farhan-108
Copy link

@farhan-108 farhan-108 commented Dec 10, 2024

Introduction

This PR implements the ZCMT extension in the CVA6 core, targeting the 32-bit embedded-class platforms. ZCMT is a code-size reduction feature that utilizes compressed table jump instructions (cm.jt and cm.jalt) to reduce code size for embedded systems
Note: Due to implementation complexity, ZCMT extension is primarily targeted at embedded class CPUs. Additionally, it is not compatible with architecture class profiles.(Ref. Unprivilege spec 27.20)

Key additions

  • Added zcmt_decoder module for compressed table jump instructions: cm.jt (jump table) and cm.jalt (jump-and-link table)

  • Implemented the Jump Vector Table (JVT) CSR to store the base address of the jump table in csr_reg module

  • Implemented a return address stack, enabling cm.jalt to behave equivalently to jal ra (jump-and-link with return address), by pushing the return address onto the stack in zcmt_decoder module

Implementation in CVA6

The implementation of the ZCMT extension involves the following major modifications:

compressed decoder

The compressed decoder scans and identifies the cm.jt and cm.jalt instructions, and generates signals indicating that the instruction is both compressed and a ZCMT instruction.

zcmt_decoder

A new zcmt_decoder module was introduced to decode the cm.jt and cm.jalt instructions, fetch the base address of the JVT table from JVT CSR, extract the index and construct jump instructions to ensure efficient integration of the ZCMT extension in embedded platforms. Table.1 shows the IO port connection of zcmt_decoder module. High-level block diagram of zcmt implementation in CVA6 is shown in Figure 1.

Table. 1 IO port connection with zcmt_decoder module

Signals IO Description Connection Type
clk_i in Subsystem Clock SUBSYSTEM logic
rst_ni in Asynchronous reset active low SUBSYSTEM logic
instr_i in Instruction in compressed_decoder logic [31:0]
pc_i in Current PC PC from FRONTEND logic [CVA6Cfg.VLEN-1:0]
is_zcmt_instr_i in Is instruction a zcmt instruction compressed_decoder logic
illegal_instr_i in Is instruction a illegal instruction compressed_decoder logic
is_compressed_i in Is instruction a compressed instruction compressed_decoder logic
jvt_i in JVT struct from CSR CSR jvt_t
req_port_i in Handshake between CACHE and FRONTEND (fetch) Cache dcache_req_o_t
instr_o out Instruction out cvxif_compressed_if_driver logic [31:0]
illegal_instr_o out Is the instruction is illegal cvxif_compressed_if_driver logic
is_compressed_o out Is the instruction is compressed cvxif_compressed_if_driver logic
fetch_stall_o out Stall siganl cvxif_compressed_if_driver logic
req_port_o out Handshake between CACHE and FRONTEND (fetch) Cache dcache_req_i_t

branch unit condition

A condition is implemented in the branch unit to ensure that ZCMT instructions always cause a misprediction, forcing the program to jump to the calculated address of the newly constructed jump instruction.

JVT CSR

A new JVT csr is implemented in csr_reg which holds the base address of the JVT table. The base address is fetched from the JVT CSR, and combined with the index value to calculate the effective address.

No MMU

Embedded platform does not utilize the MMU, so zcmt_decoder is connected with cache through port 0 of the Dcache module for implicit read access from the memory.

zcmt_block drawio
Figure. 1 High level block diagram of ZCMT extension implementation

Known Limitations

The implementation targets 32-bit instructions for embedded-class platforms without an MMU. Since the core does not utilize an MMU, it is leveraged to connect the zcmt_decoder to the cache via port 0.

Testing and Verification

  • Developed directed test cases to validate cm.jt and cm.jalt instruction functionality
  • Verified correct initialization and updates of JVT CSR

Test Plan

A test plan is developed to test the functionality of ZCMT extension along with JVT CSR. Directed Assembly test executed to check the functionality.

Table. 2 Test plan

S.no Features Description Pass/Fail Criteria Test Type Test status
1 cm.jt Simple assembly test to validate the working of cm.jt instruction in  CV32A60x. Check against Spike's ref. model Directed Pass
2 cm.jalt Simple assembly test to validate the working of cm.jalt instruction in both CV32A60x. Check against Spike's ref. model Directed Pass
3 cm.jalt with return address stack Simple assembly test to validate the working of cm.jalt instruction with return address stack in both CV32A60x. It works as jump and link ( j ra, imm) Check against Spike's ref. model Directed Pass
4 JVT CSR Read and write base address of Jump table to JVT CSR Check against Spike's ref. model Directed Pass

Note: Please find the test under CVA6_REPO_DIR/verif/tests/custom/zcmt"

@JeanRochCoulon
Copy link
Contributor

@yanicasa

@JeanRochCoulon
Copy link
Contributor

@ASintzoff

Copy link
Contributor

❌ failed run, report available here.

@fatimasaleem
Copy link
Contributor

fatimasaleem commented Dec 10, 2024

Hi @JeanRochCoulon how can we know which line in the code this spyglass failure refers to? Some inputs to instance are not driven or unconnected

@JeanRochCoulon
Copy link
Contributor

@ASintzoff do you know how to help ?

Copy link
Contributor

❌ failed run, report available here.

2 similar comments
Copy link
Contributor

❌ failed run, report available here.

Copy link
Contributor

❌ failed run, report available here.

core/id_stage.sv Outdated Show resolved Hide resolved
core/id_stage.sv Outdated Show resolved Hide resolved
Copy link
Contributor

❌ failed run, report available here.

2 similar comments
Copy link
Contributor

❌ failed run, report available here.

Copy link
Contributor

❌ failed run, report available here.

core/compressed_decoder.sv Outdated Show resolved Hide resolved
core/id_stage.sv Outdated Show resolved Hide resolved
Copy link
Contributor

❌ failed run, report available here.

2 similar comments
Copy link
Contributor

❌ failed run, report available here.

Copy link
Contributor

❌ failed run, report available here.

Copy link
Contributor

❌ failed run, report available here.

@fatimasaleem
Copy link
Contributor

@ASintzoff do you know how to help ?

@JeanRochCoulon
@farhan-108 found the issue and resolved it. Now only failure is due to the gate count increase.

@ASintzoff
Copy link
Contributor

@ASintzoff do you know how to help ?

@JeanRochCoulon @farhan-108 found the issue and resolved it. Now only failure is due to the gate count increase.

is the extension completely optional? All the RTL added for Zcmt should be removed when the extension is not set. If not, it can increase the gate count.

core/csr_regfile.sv Outdated Show resolved Hide resolved
core/csr_regfile.sv Outdated Show resolved Hide resolved
core/csr_regfile.sv Outdated Show resolved Hide resolved
core/csr_regfile.sv Outdated Show resolved Hide resolved
core/csr_regfile.sv Outdated Show resolved Hide resolved
core/csr_regfile.sv Outdated Show resolved Hide resolved
core/csr_regfile.sv Outdated Show resolved Hide resolved
core/issue_read_operands.sv Outdated Show resolved Hide resolved
Copy link
Contributor

❌ failed run, report available here.

Copy link
Contributor

❌ failed run, report available here.

core/include/cv32a6_ima_sv32_fpga_config_pkg.sv Outdated Show resolved Hide resolved
core/include/cv32a6_imac_sv0_config_pkg.sv Outdated Show resolved Hide resolved
core/include/cv32a6_imac_sv32_config_pkg.sv Outdated Show resolved Hide resolved
core/include/cv32a6_imafc_sv32_config_pkg.sv Outdated Show resolved Hide resolved
core/include/cv64a6_imafdc_sv39_config_pkg.sv Outdated Show resolved Hide resolved
core/include/cv64a6_imafdc_sv39_hpdcache_config_pkg.sv Outdated Show resolved Hide resolved
core/include/cv64a6_imafdc_sv39_openpiton_config_pkg.sv Outdated Show resolved Hide resolved
core/include/cv64a6_imafdc_sv39_wb_config_pkg.sv Outdated Show resolved Hide resolved
core/include/cv64a6_imafdch_sv39_config_pkg.sv Outdated Show resolved Hide resolved
core/include/cv64a6_imafdch_sv39_wb_config_pkg.sv Outdated Show resolved Hide resolved
);
end
if (CVA6Cfg.RVZCMP) begin
if (CVA6Cfg.RVZCMP || (CVA6Cfg.RVZCMT & ~CVA6Cfg.MmuPresent)) begin //MMU should be off when using ZCMT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bet 2 cents that the gate count increase comes from this condition: decoder_macro and zcmt_decoder are both inferred by the same if condition. To me, decoder_macro depends on zcmp and zcmt_decoder depends on zcmt.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JeanRochCoulon almost ~1k gate count is increased in issue_stage, which I think is because we added one signal in scoreboard struct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we increase the expected_synth gate count. What do you think, @JeanRochCoulon?

Copy link
Contributor

❌ failed run, report available here.

Copy link
Contributor

❌ failed run, report available here.

Copy link
Contributor

❌ failed run, report available here.

Copy link
Contributor

@cathales cathales left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution. Here are some suggestions.

By the way, do you think it would be feasible to make it compatible with superscalar?

Also, there is still the area increase which is really impacting even when it is disabled at elaboration time.

core/branch_unit.sv Outdated Show resolved Hide resolved
core/ex_stage.sv Outdated Show resolved Hide resolved
core/ex_stage.sv Outdated Show resolved Hide resolved
core/decoder.sv Outdated Show resolved Hide resolved
core/include/config_pkg.sv Outdated Show resolved Hide resolved
core/issue_stage.sv Outdated Show resolved Hide resolved
core/zcmt_decoder.sv Show resolved Hide resolved
Comment on lines +14 to +28
input logic clk_i, // Clock
input logic rst_ni, // Synchronous reset
input logic [31:0] instr_i, // instruction
input logic [CVA6Cfg.VLEN-1:0] pc_i, // PC
input logic is_zcmt_instr_i, // Intruction is of macro extension
input logic illegal_instr_i, // From compressed decoder
input logic is_compressed_i, // is compressed instruction
input jvt_t jvt_i,
input dcache_req_o_t req_port_i, // Data cache request ouput - CACHE

output logic [31:0] instr_o, // Instruction out
output logic illegal_instr_o, // Illegel instruction
output logic is_compressed_o, // is compressed instruction
output logic fetch_stall_o, // Wait while address fetched from table
output dcache_req_i_t req_port_o // Data cache request input - CACHE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most port documentations here miss the "connexion" field, used to produce tables such as https://cva6.readthedocs.io/en/latest/04_cv32a65x/design/design.html#frontend-description

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cathales I have updated the description and added the IO port connection table.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot see it as of 2124fa6.

The expected format is:

  • A line with only a comment, before the port declaration
  • In the comment, the port description followed by " - " and the connection

For instance:

// Data cache request input - CACHE
output dcache_req_i_t req_port_o

Not:

output dcache_req_i_t req_port_o // Data cache request input - CACHE

Nor:

// Data cache request input
output dcache_req_i_t req_port_o

jump_addr[10:1],
jump_addr[11],
jump_addr[19:12],
5'h1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved_branch does not currently update RAS in frontend.
So this instruction does not update RAS, does it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cathales The resolved_branch signal is updated in the branch unit, ensuring that the RAS is also updated and enabling jump on calculated address instead of predicted address.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU, resolved_branch has no influence on the RAS as of f4355fa

cva6/core/frontend/frontend.sv at f4355fa49b6ed592579146dbfc65a34f3bbd6ed2 · openhwgroup/cva6

Instead, the frontend could be modified:

  • intsr_scan could detect that it is a call
  • this call detection would rise ras_push
  • this ras_push signal would push the correct address into the RAS

end
TABLE_JUMP: begin
if (req_port_i.data_rvalid) begin
jump_addr = $unsigned($signed(req_port_i.data_rdata) - $signed(pc_i));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not work if the target address is too far from the current program counter.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cathales This mechanism supports jumps within a range of ±1 MB from the current program counter, as the addr_jump field uses 20 bits as an offset immediate from the PC, it is similar to the JAL instruction in 32-bit RISC-V.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of this limitation, this implementation is not compliant with Zcmt, is it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cathales according to ZCMT spec cm.jt and cm.jalt is equivalent to
• 32-bit j calls
• 32-bit jal ra calls
and J / jal ra calls only jump to ±1 MB
Thus implementation is complaint with zcmt as it is only for 32 bits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At https://github.com/riscv/riscv-isa-manual/blob/main/src/zc.adoc#table-jump-overview RISC-V specification states

cm.jt and cm.jalt encodings index the table, giving access to functions within the full XLEN wide address space.

Therefore, the target addresses shall be any address and not only addresses near the current PC.

@farhan-108
Copy link
Author

farhan-108 commented Dec 17, 2024

Thank you for your contribution. Here are some suggestions.

By the way, do you think it would be feasible to make it compatible with superscalar?

Also, there is still the area increase which is really impacting even when it is disabled at elaboration time.

@cathales Currently, it is not compatible with a superscalar architecture because it stalls the pipeline in the decode stage until the implicit fetch is completed. Compatibility can be addressed in a future update alongside the ZCMP extension.

@@ -48,6 +48,8 @@ module decoder
input logic is_last_macro_instr_i,
// Is mvsa01/mva01s macro instruction - macro_decoder
input logic is_double_rd_macro_instr_i,
//zcmt instruction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//zcmt instruction
// Zcmt instruction - FRONTEND

@@ -373,6 +376,7 @@ package config_pkg;
assert (Cfg.NrPMPEntries <= 64);
assert (!(Cfg.SuperscalarEn && Cfg.RVF));
assert (!(Cfg.SuperscalarEn && Cfg.RVZCMP));
assert (!(Cfg.SuperscalarEn && Cfg.RVZCMT && ~CVA6Cfg.MmuPresent));
Copy link
Contributor

@cathales cathales Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cfg.SuperscalarEn && Cfg.RVZCMT won’t trigger this assertion. You could split into two distinct assertions:

Suggested change
assert (!(Cfg.SuperscalarEn && Cfg.RVZCMT && ~CVA6Cfg.MmuPresent));
assert (!(Cfg.SuperscalarEn && Cfg.RVZCMT));
assert (!(Cfg.RVZCMT && ~Cfg.MmuPresent));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the jobs fail because it should be Cfg instead of CVA6Cfg here.

Comment on lines +93 to +94
resolved_branch_o.is_mispredict = 1'b1; // miss prediction for ZCMT
resolved_branch_o.cf_type = ariane_pkg::Jump;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could send a JumpR instead of Jump. It would update BTB in the frontend to correctly predict the destination next time.
We could then calculate is_mispredict accordingly, to use this prediction when we later encounter the same cm.jt or cm.jalt.

end
TABLE_JUMP: begin
if (req_port_i.data_rvalid) begin
jump_addr = $unsigned($signed(req_port_i.data_rdata) - $signed(pc_i));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of this limitation, this implementation is not compliant with Zcmt, is it?

Copy link
Contributor

❌ failed run, report available here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants